Objective:
To categorize the countries using socio-economic and health factors that determine the overall development of the country.
Problem Statement:
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.
Context:
HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.
e1071, tidyverse, plotly, htmltools, devtools, caret, NbClust, reshape2, rvest, magrittr, stringr, cowplot, ggmap
DATA DICTIONARY
* country: Name of the country
* child_mort: Death of children under 5 years of age per 1000 live births
* exports: Exports of goods and services per capita. Given as %age of the GDP per capita
* health: Total health spending per capita. Given as %age of GDP per capita
* imports: Imports of goods and services per capita. Given as %age of the GDP per capita
* income: Net income per person
* inflation: The measurement of the annual growth rate of the Total GDP
* life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same
* total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same.
* gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.
Peep First Five Rows
## # A tibble: 6 × 10
## country child_mort exports health imports income inflation life_expec
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Afghanistan 90.2 10 7.58 44.9 1610 9.44 56.2
## 2 Albania 16.6 28 6.55 48.6 9930 4.49 76.3
## 3 Algeria 27.3 38.4 4.17 31.4 12900 16.1 76.5
## 4 Angola 119 62.3 2.85 42.9 5900 22.4 60.1
## 5 Antigua and Barbuda 10.3 45.5 6.03 58.9 19100 1.44 76.8
## 6 Argentina 14.5 18.9 8.1 16 18700 20.9 75.8
## # … with 2 more variables: total_fer <dbl>, gdpp <dbl>
Data Dimensions
## Shape: 167 10
## Columns: country child_mort exports health imports income inflation life_expec total_fer gdpp
## Country Labels: 'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda' ...
…
## Total Missing Values: 0
…
No chr variables to convert to factor
## tibble [167 × 9] (S3: tbl_df/tbl/data.frame)
## $ child_mort: num [1:167] 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
## $ exports : num [1:167] 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
## $ health : num [1:167] 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
## $ imports : num [1:167] 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
## $ income : num [1:167] 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
## $ inflation : num [1:167] 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
## $ life_expec: num [1:167] 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
## $ total_fer : num [1:167] 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
## $ gdpp : num [1:167] 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...
…
…
## # A tibble: 6 × 9
## child_mort exports health imports income inflation life_expec total_fer
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.426 0.0495 0.359 0.258 0.00805 0.126 0.475 0.737
## 2 0.0682 0.140 0.295 0.279 0.0749 0.0804 0.872 0.0789
## 3 0.120 0.192 0.147 0.180 0.0988 0.188 0.876 0.274
## 4 0.567 0.311 0.0646 0.246 0.0425 0.246 0.552 0.790
## 5 0.0375 0.227 0.262 0.338 0.149 0.0522 0.882 0.155
## 6 0.0579 0.0940 0.391 0.0916 0.145 0.232 0.862 0.192
## # … with 1 more variable: gdpp <dbl>
…
country_kmeans = kmeans(
countries,
centers=2,
algorithm="Lloyd",
iter.max=30
)
Evaluate Cluster Quality
## Variance Explained: 0.393483
…
Load Map Data
## long lat group order region subregion
## 1 -69.89912 12.45200 1 1 Aruba <NA>
## 2 -69.89571 12.42300 1 2 Aruba <NA>
## 3 -69.94219 12.43853 1 3 Aruba <NA>
## 4 -70.00415 12.50049 1 4 Aruba <NA>
## 5 -70.06612 12.54697 1 5 Aruba <NA>
## 6 -70.05088 12.59707 1 6 Aruba <NA>
Visualize Socio-Economic Clusters
…
Elbow Method
NbClust Method
…
final_kmeans <- kmeans(
countries,
centers=3,
algorithm="Lloyd",
iter.max=30
)
Evaluate Cluster Quality
## Variance Explained: 0.547986
Visualize Socio-Economic Clusters
…
…